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ABSTRACT  We  describe  an  efficient  algorithm 

to  construct  genome  wide  haplotype  restriction  maps 
of  an  individual  by  aligning  single  molecule  DNA  frag¬ 
ments  collected  with  Optical  Mapping  technology.  Us¬ 
ing  this  algorithm  and  small  amount  of  genomic  mate¬ 
rial,  we  can  construct  the  parental  haplotypes  for  each 
diploid  chromosome  for  any  individual,  one  from  the  fa¬ 
ther  and  the  other  from  the  mother.  Since  such  haplo¬ 
type  maps  reveal  the  polymorphisms  due  to  single  nu¬ 
cleotide  differences  (SNPs)  and  small  insertions  and  dele¬ 
tions  (RFLPs),  they  are  useful  in  association  studies, 
studies  involving  genomic  instabilities  in  cancer,  and  ge¬ 
netics.  For  instance,  such  haplotype  restriction  maps  of 
individuals  in  a  population  can  be  used  in  association 
studies  to  locate  genes  responsible  for  genetics  diseases 
with  relatively  low  cost  and  high  throughput. 

If  the  underlying  problem  is  formulated  as  a  combi¬ 
natorial  optimization  problem,  it  can  be  shown  to  be 
NP-complete  (a  special  case  of  A'-population  problem). 
But  by  effectively  exploiting  the  structure  of  the  under¬ 
lying  error  processes  and  using  a  novel  analog  of  the 
Baum- Welch  algorithm  for  HMM  models,  we  devise  a 
probabilistic  algorithm  with  a  time  complexity  that  is 
linear  in  the  number  of  markers. 

The  algorithms  were  tested  by  constructing  the  first 
genome  wide  haplotype  restriction  map  of  the  microbe 
T.  Pseudoana,  as  well  as  constructing  a  haplotype  re¬ 
striction  map  of  a  120  Megabase  region  of  Human  chro¬ 
mosome  4.  The  frequency  of  false  positives  and  false 
negatives  was  estimated  using  simulated  data.  The  em¬ 
pirical  results  were  found  very  promising. 


*To  whom  correspondence  should  be  addressed.  E-mail: 
tsa@biostat.wisc.edu,  Tel:  608-3470637 


1  Introduction 

Diploid  organisms,  such  as  humans,  carry  two  mostly  similar 
copies  of  each  chromosome,  referred  to  as  haplotypes.  Vari¬ 
ations  in  a  large  population  of  haplotypes  at  specific  loci  are 
called  polymorphisms.  The  co-associations  of  these  varia¬ 
tions  across  the  loci  indices  are  of  intense  interest  in  disease 
research. 

The  polymorphic  marker  of  most  interest  has  been  the  Sin¬ 
gle  Nucleotide  Polymorphism  (SNP).  There  are  estimated  to 
be  15  million  such  markers  over  the  entire  human  genome 
(including  only  SNPs  where  the  minor  allele  occurs  at  least 
5%  of  the  time)  and  a  realistic  genome  wide  association  study 
would  require  at  least  1000  samples  to  be  tested  for  each  of 
these  15  million  SNPs.  By  taking  advantage  of  linkage  dise¬ 
quilibrium  [1,  7,  8,  17]  the  number  of  SNPs  that  need  to  be 
tested  can  be  reduced  to  about  300,000.  The  main  limitation 
of  most  SNP  based  approaches  is  that  each  SNP  is  assayed 
separately  and  hence  it  is  not  possible  to  accurately  infer  the 
exact  haplotype  map  for  a  particular  individual  (which  we 
will  refer  to  as  an  individual  haplotype  map  and  is  sometimes 
called  one-individual  haplotype  map  or  personal  haplotype 
map)  from  SNP  assays  alone.  Instead,  the  phase  is  inferred 
statistically  from  a  large  population  of  SNP  data.  These 
phasing  algorithms  use  assumptions  such  as  parsimony  in  the 
total  number  of  different  haplotypes  in  the  population,  the 
Hardy- Weinberg  equilibrium,  perfect  phylogeny  to  combina- 
torially  constrain  the  possible  haplotypes  [5,  9,  10,  12,  14] 
or  alternatively,  use  statistical  approaches  [19,  21]  (available 
as  phase  and  haplotyper  software  packages).  The  statis¬ 
tical  algorithms  have  tended  to  be  bit  more  accurate,  but 
unforgivingly  slow.  The  results  of  all  these  algorithms  are 
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referred  to  as  population  haplotype  maps  and  are  typically 
only  90-95%  accurate  in  any  particular  map  on  average  and 
would  thus  rarely  be  correct  in  any  individual  haplotype. 
The  inaccuracies  emanate  from  multiple,  complex  and  un¬ 
wieldy  sources  of  errors:  population  stratifications,  violation 
to  perfect  phylogeny  assumption  (with  hemizygous  deletions 
or  gene  conversion),  or  corrupted  data  manifesting  as  geno¬ 
type  errors.  Thus,  it  is  necessary  that  we  explore  new  di¬ 
rect  and  cost-effective  methods,  where  a  single  individual’s 
genome  is  analyzed  to  create  the  haplotype  sequences  or 
maps  without  making  any  overly  generalized  assumptions, 
using  population  wide  statistics,  or  requiring  the  availability 
of  parental  genomes  (as  in  trio-studies). 

For  a  genotyping  method  to  be  able  to  correctly  deter¬ 
mine  the  phasing  between  neighboring  polymorphic  mark¬ 
ers  in  every  individual  haplotype  map,  it  must  ultimately 
be  able  to  test  single  DNA  fragments  containing  2  or  more 
heterozygous  polymorphic  markers  in  a  single  test.  It  is  pos¬ 
sible,  of  course,  to  assemble  individual  haplotype  maps  by 
sequencing  the  individual’s  entire  genome  using  a  modified 
sequence  assembly  algorithm  [16,  24]  but  the  cost  of  doing 
this  is  prohibitive1. 

Here,  we  propose  a  direct  and  more  cost-effective  approach 
using  the  fairly  well  developed  single  molecule  technology  of 
Optical  Mapping.  We  demonstrate  the  feasibility  of  con¬ 
structing  such  an  accurate  individual  haplotype  map  of  re¬ 
striction  sites  through  pilot-scale  experiments  and  simula¬ 
tion  studies.  A  brief  description  of  the  data  generated  by 
Optical  Mapping  technology  is  given  in  the  following  sec¬ 
tion,  and  many  more  details  of  the  technology  can  be  found 
in  the  literature  (e.g.  [15,  18,  25]).  Unfortunately,  the  in¬ 
put  genomic  data  that  can  be  collected  from  a  single  DNA 
molecule  by  the  best  chemical  and  optical  methods  (such 
as  those  used  in  Optical  Mapping)  are  corrupted  by  many 
poorly  understood  noise  processes.  Thus  to  make  this  sys¬ 
tem  feasible,  the  biggest  challenge  has  been  in  developing 
accurate  Bayesian  probabilistic  models  of  errors  for  experi¬ 
ment  design  and  efficient  maximum  likelihood  algorithms  to 
achieve  accuracy  with  sufficiently  redundant  data. 

Each  individual  haplotype  map  of  restriction  sites  will  only 
detect  a  small  fraction  of  all  polymorphisms  in  the  human 
genome,  but  using  the  same  linkage  disequilibrium  assump¬ 
tion  mentioned  previously,  approximately  8  individual  hap¬ 
lotype  restriction  maps  will  contain  more  than  the  300,000 
SNPs  required  to  infer  all  other  known  polymophisms  in  the 
individual  genome.  Even  with  50  fold  data  redundancy  re¬ 
quired,  all  date  required  for  8  individual  haplotype  restric¬ 
tion  maps  can  be  collected  for  under  $1000. 

In  this  paper,  we  present  an  unambiguous  mathemati- 

1  Tli  is  cost  has  been  estimated  to  be  over  $10  million  per  individual 


cal  formulation  of  the  problem  (section  2),  combinatorial 
complexity  of  the  resulting  problem  (also  section  2),  an  effi¬ 
cient  probabilistic  algorithmic  solution  built  upon  a  detailed 
Bayesian  model  of  the  underlying  error  sources  (section  3) 
and  finally,  its  complexity  analysis  (section  4).  We  also  pro¬ 
vide  data  on  the  performance  of  the  algorithm  on  real  and 
simulated  examples  (section  5)  and  conclude  with  a  discus¬ 
sion  of  the  future  problems  (section  6). 

2  Problem  Formulation 

Our  problem  can  be  formulated  mathematically  as  follows: 
We  assume  that  all  individual  single  molecule  DNA  frag¬ 
ments  are  derived  from  a  diploid  genome  (ignoring  the  case 
of  sex  chromosomes)  with  two  copies  of  homologous  chromo¬ 
somes.  Each  DNA  fragment  is  further  mapped  by  cleavage 
with  a  restriction  enzyme  of  choice  and  imaged  by  an  imag¬ 
ing  algorithm  to  produce  an  ordered  sequence  of  “restriction 
fragment  lengths”  or  equivalently,  “restriction  sites.”  The 
variations  in  these  restriction  fragment  lengths  are  primar¬ 
ily  due  to  RFLPs  as  well  as  SNPs  at  the  restriction  sites. 
Additionally,  there  are  further  variations  introduced  by  the 
experimental  process  and  could  be  assumed  due  to:  sizing 
errors,  partial  digestion,  short  missing  restriction  fragments, 
false  cuts,  ambiguities  in  the  orientation,  optical  chimerisms, 
etc.  Thus,  the  genomes  may  be  represented  as  two  haplo¬ 
type  restriction  maps,  Hi  and  H2,  for  the  same  individual 
which  differ  only  slightly  from  a  genotype  restriction  map  H 
by  a  small  number  of  short  insertions,  deletions  and  SNPs 
that  coincide  with  restriction  sites.  All  such  maps,  H,  Hi 
and  H2 ,  are  assumed  to  be  representable  as  a  sequence  of  re¬ 
striction  sites  (e.g.  H2i,  with  indices  0  <  i  <  (IV +1),  where 
H2  q  and  H2<n+i  represent  the  chromosome  ends),  but  are 
unknown.  However,  short  DNA  fragments  of  around  500 
Kb  derived  from  such  maps,  and  further  corrupted  by  ex¬ 
perimental  noise  processes  can  be  readily  generated  at  high 
throughput  and  very  low  cost  using  a  technology  like  Opti¬ 
cal  Mapping  [15,  18,  25].  These  short  DNA  fragments  will 
be  written  as  Dj.,  with  indices  1  <  k  <  M,  where  M  is 
the  number  of  data  fragments  and  each  data  fragment  is  in 
turn  represented  as  a  sequence  of  restriction  sites  (e.g.  Dkj, 
0  <  j  <  mfc  +  1  )  and  can  be  aligned  globally  to  create 
an  estimate  of  genotype  map  H  using  algorithms  described 
previously  [3]. 

The  algorithmic  problem,  we  wish  to  study,  is  to  further 
separate  H  into  two  maps  Hi  and  H2  in  such  a  manner  that 
each  data  fragment  Dt~  is  aligned  well  to  one  haplotype  or 
other  and  that  Hi  and  H2  differ  from  H  only  by  modifica¬ 
tions  consistent  with  SNPs  or  RFLPs  polymorphisms. 

Thus,  ultimately,  this  problem  corresponds  to  a  problem  of 
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refining  a  multiple  map  alignment  in  to  two  families,  starting 
with  one  global  alignment.  A  combinatorial  generalization, 
where  the  number  of  such  families  is  arbitrarily  large  (k  >  1) 
and  the  cost  of  each  alignment  is  arbitrarily  unconstrained, 
has  been  shown  to  lead  to  computationally  infeasible  prob¬ 
lems.  See  [20]  for  the  proof  of  NP-completeness  as  well  as 
a  probabilistic  analysis  to  show  conditions  under  which  the 
problem  can  be  solved  efficiently  with  a  probability  close  to 
one.  The  key  to  an  effective  solution  of  these  problems  re¬ 
lies  on  careful  experiment  design  (e.g.,  choice  of  coverage, 
restriction  enzyme,  experimental  conditions,  etc.)  to  ensure 
conditions  under  which  a  polynomial  time  probabilistic  algo¬ 
rithm  will  work  with  high  probability  in  conjunction  with  a 
Bayesian  error  model  that  encodes  the  error  processes  prop¬ 
erly. 

To  construct  individual  haplotype  maps  from  Optical 
Mapping  data  we  use  a  mixture  hypothesis  of  pairs  of  maps 
H i  and  H2  for  each  chromosome,  corresponding  to  the  cor¬ 
rect  restriction  map  of  the  two  parental  chromosomes.  We 
first  assemble  the  data  into  a  regular  map  of  the  entire 
genome  and  use  this  assembly  to  separate  the  data  into  dis¬ 
tinct  chromosome  sets:  all  maps  from  the  same  chromosome 
belonging  to  a  pair  will  be  included  in  the  same  set.  We 
then  use  a  probabilistic  model  of  the  errors  in  the  data  to 
derive  conditional  probability  density  expressions  f(Dk\H\) 
and  f(Dk\H2),  and  apply  Bayes  rule  to  maximize  a  score 
for  the  best  alignment  with  respect  to  proposed  H 1  and  H2, 
Equation  1.: 

f(H1,H2\Du...,DM)  (1) 

cx  f(H1,H2)f(D1,...,DM\H1,H2) 

The  first  term  on  the  right  side  is  the  prior  probability  of 
H 1  and  H2  and  we  just  use  a  low  prior  probability  for  each 
polymorphism  (difference  in  H 1  vs.  H2).  For  the  conditional 
probability  term,  we  can  assume  each  map  is  a  statistically 
independent  sample  from  the  genome  and  that  the  mapping 
errors  are  drawn  from  i.i.d.  distributions  and  hence  write: 


M 


f(D1,..,,DM\H1,H2)  =  H 


[/(£*!#!)  + /(Dfc|if2)] 


(2) 


fc= 1 


The  conditional  terms  of  the  form  f(Dk\Hi)  above  can 
be  written  as  a  summation  over  all  possible  (mutually  ex¬ 
clusive)  alignments  between  the  particular  D k  and  Hi,  and 
for  each  alignment  the  probability  density  is  based  on  an 
enumeration  of  the  map  errors  in  the  alignment  and  mul¬ 
tiplying  together  the  probability  associated  with  each  error 
under  some  suitable  error  model.  The  exact  form  of  the  er¬ 
ror  models  suitable  for  Optical  Mapping  is  described  in  the 
next  section,  but  for  almost  any  error  models  used  the  sum 


of  the  probability  for  all  alignments  can  be  computed  effec¬ 
tively  using  dynamic  programming.  In  the  next  two  sec¬ 
tions,  we  will  derive  the  dynamic  programming  recurrence 
relations  for  this  problem  and  show  how  to  implement  the 
algorithm  with  attractive  computational  complexities.  Us¬ 
ing  these  alignment  algorithms,  we  will  see  how  it  is  possible 
to  quickly  search  through  the  space  of  all  haplotype  pairs  to 
find  the  most  plausible  ones  consistent  with  the  data. 

Other  methods  for  assembling  Optical  Mapping  data  for 
relatively  short  clones  into  genotype  restriction  maps  exist, 
e.g.,  ones  based  on  Markov  Chain  Monte  Carlo  search  [22]  or 
Maximum  Likelihood  [23]  or  heuristic  scoring  functions  [13]. 
However  these  methods  are  yet  to  be  scaled  to  map  whole 
genomes.  Even  for  restriction  maps  of  BAC  clones  these 
methods  produce  the  correct  map  only  50%  of  the  time  [13] 
and  have  no  way  of  signaling  when  the  method  fails. 
Bayesian  algorithms  [3,  4],  similar  to  the  ones  described 
here,  are  routinely  used  by  untrained  chemists  and  biolo¬ 
gists  to  assemble  genotype  restriction  maps  with  a  success 
rate  of  higher  than  90%,  and  produce  p- values  to  signal  fail¬ 
ure  rather  than  produce  an  incorrect  map. 


3  Algorithm 

3.1  Computing  Conditional  Probabilities 
for  a  Hypothesis 

Theorem  3.1  Consider  an  arbitrary  alignment  between  the 
data  D  and  the  hypothesis  H ,  Jth  restriction  site  of  D 
matching  the  Ith  restriction  site  of  H .  We  will  denote  this 
aligned  pair  by  J  i— >  I . 

Let  the  probability  density  of  the  unaligned  portion  on 
the  left-  and  right  end  of  such  an  alignment  be  denoted  by 
fUr{I,J)  on  the  right  end  if  J  1— >  I  is  the  rightmost  aligned 
pair,  and  fui  (I,  J)  on  the  left  end  if  J  *—>  I  is  the  leftmost 
aligned  pair. 

In  addition,  the  following  probability  density  functions  fm 
and  fa  denote  the  following: 

fm(I,P )  =  Pr[H[I..P]  is  missing  in 

the  observed  data  D\. 

fa(I,J,P,Q)  =  Pr[H[I..P]  is  an  aligned  region  but  not 
a  missing  fragment  with  respect  to 
the  observed  data  region  D[J..Q ]]. 

We  assume  that  I  <  P  and  J  <  Q. 

Then  the  following  holds: 

N  ra+1 

f(D\H)  M1’  J)f(D[J..m  +  l]|H[/..iV]  A  J  »  I). 

1=1  J= 0 
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1  2  3  45  I  P  N 

'  I  '  N  j  + 

12  J  Q  m 


Figure  1:  To  define  the  notation  required  we  consider  a  single  arbitrary  alignment  between  a  particular  data  D  and 
hypothesis  H .  Recall  that  N  is  the  number  of  restriction  sites  in  H  and  m  the  number  of  restriction  sites  in  D.  Any 
arbitrary  alignment  between  D  and  H  can  be  described  as  a  list  of  pairs  of  restriction  sites  from  H  and  D  that  describes 
which  restriction  site  from  H  is  aligned  with  which  restriction  site  from  D.  As  an  example,  Here  the  alignment  consists  of 
4  aligned  pairs  (4,  2),  (5,  2),  (I,  J)  and  (P,  Q).  Notice  that  not  all  restriction  sites  in  H  or  D  need  be  aligned.  For  example 
between  aligned  pairs  (J,  J)  and  (P,  Q)  there  is  one  misaligned  site  on  H  and  D  each,  corresponding  to  a  missing  site 
(false-negative)  and  extra-site  (false-positive)  in  D.  In  this  alignment  a  true  small  fragment  between  sites  4  and  5  in  H  are 
missing  from  D,  which  is  shown  by  aligning  both  sites  4  and  5  in  H  with  the  same  site  2  in  D.  Note  that  if  two  or  more 
consecutive  fragments  in  H  are  all  missing  in  D,  this  would  be  described  by  aligning  all  sites  for  the  missing  fragments  in 
H  with  the  same  site  in  D  (rather  than  showing  only  the  outermost  of  this  set  of  consecutive  sites  in  H  aligned  with  D , 
for  example).  The  expression  for  the  conditional  probability  density  of  any  alignment,  such  as  the  one  here,  can  be  written 
as  the  product  of  a  number  of  probability  terms  corresponding  to  the  regions  of  alignment  between  each  pair  of  aligned 
sites,  plus  one  probability  term  for  each  unaligned  region  at  the  two  ends  of  the  alignment. 


f(D[J..m  +  1}\H[I..N]  A  J  n  I) 

=  fur(I,J) 

+fm(I ,  I  +  1  )f(D[J..m  +  1]| H[I  +  1..1V]  A  J  i— >  (/  +  1)) 

N  m+1 

+  E  E 

P=I+ 1 Q=J+ 1 

/.(/,  J,  P,  Q)f([Q..m  +  1]| H[P..N]  A  Q  ^  P) 


In  particular,  if  the  intermediate  values  are  kept  in  a  DP 
table  Asuf  [J,  J] 

Asuf [/,  J]  =  f(D[J..m  +  1]| H[I..N]  A  J  i— >  I) 

then  it  is  easily  seen  that  f(D\H)  can  be  computed  exactly 
in  0{rri2N2)  time  and  0{mN)  space,  assuming  that  fm  and 
fa  are  0(1)  time  functions  and  fui  and  fur  are  O(N)  time 
functions.  □ 


In  a  later  section  we  will  see  how  to  reduce  the  complexity 
to  linear  time  when  we  only  require  an  e-approximate  value 
/ 


f(D\H)  —  e  <  f(D\H)  <  f(D\H)  —  e, 


for  the  probability  density  function  arising  in  the  context  of 
optical  mapping  as  follows: 


fm(I,I+  1)  = 

fa(I,J,P,Q)  =  A  Q-J-1Pd{l-Pd)P-I-1(l-Pv)Hp~Hl 

xG(Hp-Hr),a2(Hp-Hi)(DQ  -  Dj), 
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where 

Pd 

A 

a2h 

Pu 

Re 


reader  to  the  appendix:  for  details.  The  omitted  cases  are 
similar  but  tedious. 

We  may  obtain  H'  by 


the  digest  rate, 
the  false-positive  site  rate, 
the  Gaussian  sizing  error  variance 
for  a  fragment  of  size  h, 
the  probability  of  missing  a 
fragment  of  unit  size,  and 
the  breakage  rate  of  DNA 

(the  inverse  of  the  expected  fragment  size). 


Deleting  one  of  the  existing  restriction  sites  in  H ,  as  the 
site  may  contain  a  heterozygous  SNP; 

Adding  a  new  restriction  site  at  a  specified  location  in 
H,  symmetrical  to  the  previous  case; 


For  a  random  variable  x  following  a  Gaussian  distribution 
A /" (/x,  cr2 ) ,  the  probability  density  value  at  d  is 


3.  Increasing  or  decreasing  a  restriction  fragment  length  in 
H,  an  RFLP; 


G^{d) 


exp[—  (d  —  p)2 /2a2] 
■\f2na 


The  exact  form  of  the  functions  for  fui  and  fur  for  Optical 
Mapping  are  complicated,  but  not  very  important  in  under¬ 
standing  the  complexity  of  the  algorithm;  thus  a  detailed 
discussion  is  omitted  here,  but  can  be  seen  in  the  appendix. 
Note  that,  at  first  glance,  fui  and  fur  may  appear  to  be 
O(N)  time  functions,  but  it  is  easily  seen  that  they  permit 
0(1)  e-approximation. 

As  it  has  been  shown  elsewhere  [2],  a  good  approximate 
location  of  the  best  alignment  between  D  and  H  can  be  de¬ 
termined  in  0(1)  expected  time,  if  the  conditional  probabil¬ 
ity  density  has  been  previously  evaluated  for  a  similar  H  or 
alternatively,  through  a  geometric  hashing  algorithms.  Only 
a  0(l)-width  band  of  the  DP  table  needs  to  be  evaluated 
to  compute  an  e-approximation  f(D\H).  In  particular,  the 
band  width  of  the  DP  table  used  in  practice  is  usually  about 
A  =  8;  more  generally  for  Optical  mapping  A  is  bounded  by 


(1  -  P*)*"1  =  e, 


or 


A  =  1  + 


Mf) 

ln(l  —  Pd) ' 


With  this  approach  we  achieve  a  reduced  time  complexity  of 
0(min(m,  N))  (more  explicitly,  0(nrin(mA3,  N))). 


Consequently,  we  may  also  need  to  compute  the  first  and 
second  derivative  of  f(D\H)  relative  to  the  change  in  any 
fragment  size  in  H . 


Theorem  3.2  Consider  an  arbitrary  alignment  between  the 
data  D  and  the  hypothesis  H ,  Jth  restriction  site  of  D 
matching  the  Ith  restriction  site  of  H .  Using  the  notations 
of  the  previous  subsection,  we  write: 


Asuf[J,  J]  =  f(D[J..m.  +  1]|U[/..A]  A  J  /),  and 
Apref  [I,J]  =  f(D[O..J]\H[l..I]AJ»I). 

Then 

Asuf  [I,  J] 

—  fur  (I i  J)  +  fm(I ,  I  +  l)Asuf  [I  +  1,  J] 

N  ra+1 

+  E  E  fa(I,J,P,Q)Asui[P,Q], 

P—I- 1-1  Q=J-\- 1 


3.2  Recomputing  Conditional  Probabili¬ 
ties  for  a  Modification  to  Hypothe¬ 
sis 

We  next  consider  following  problem:  How  can  one  re¬ 
evaluate  the  conditional  probability  distribution  function, 
f(D\H'  =  p(H))  when  the  new  hypothesis,  H',  has  been  ob¬ 
tained  by  locally  changing  H  in  just  one  place  (correspond¬ 
ing  to  a  polymorphism).  There  are  three  cases  to  consider. 
We  study  one  of  the  three  cases  here  in  detail  and  refer  the 


and  similarly, 

Apref  [1 5  J\ 

=  ful  (/,  J)  +  Apref  [/  -  1 ,  J]  /m  (/  -  1 , 1) 

1-1  7-1 

+  EE  Apref  [P,Q\fa(I,J,P,Q), 

P= 1 Q- 0 

If  H  \  {Hk}  is  obtained  from  H  by  deleting  the  site  Hk, 
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then 


f(D\H\{HK}) 

=  Pr[Alignments  with  rightmost  aligned  I  <  K] 
-\-Pr[Alignments  with  leftmost  aligned  J  >  K] 

+Pr  [Alignments  with  a  fragment  spanning 
H[K  —  1..K  +  1]] 

K- 1  m+1 

=  E  E  Apref[/,J]/™(/,J) 

1=1  J= o 

AT  m+1 

+  E  E^rHfe)++Asuf[/,j] 

J=AT+1  J= 0 
m+1 

E  Apref  [K  -  1,  J]/m(A  -  1,  K  +  l)Asuf  [K  +  1,  J] 

J=0 


m+1  K—  1  AT  m+1 

+  EE  E  E  Vef+J] 

J=0  1=1  P=K+1  Q=J+1 


fa(I,J,P,Q) 
l  ~Pd 


Asuf  [P,  Q] , 


where  Hk'>  and  fur  Hk ^  are  computed  respectively  from  fui 
and  fur  by  suitable  simple  modifications. 

Then  it  is  seen  that  f(D\H\  {Hk}),  VAT  1  <  K  <  N ,  can 
be  computed  exactly  in  0(m2N2)  time  and  0(mN)  space, 
assuming  that  fm  and  fa  are  0(1)  time  functions  and  fui 
and  fur  are  0(N )  time  functions. 

Proof  Sketch: 

We  omit  the  derivation  of  the  recurrence  relation.  In  or¬ 
der  to  see  how  the  stated  computational  complexity  can  be 
achieved,  observe  that:  (1)  We  can  amortize  the  cost  of  com¬ 
puting  f(D\H  \  {Hk})  over  all  1  <  K  <  N  by  evaluating 
each  term  with  /a(/,  J,  P ,  Q),  fifHk\l,  J),  flfHk\l,  J)  just 
once.  For  example,  any  term 


Apref  [A  J\fa{:[^pdQ)  bsutlP,  Q. 1 

will  be  present  in  the  summation  of  f{D\H  \  {Hk})  for  all 
I  <  I\  <  P  and  absent  for  all  other  K. 

(2)  If  all  f{D\H  \  {Hk})  for  all  1  <  K  <  N  are  summed 
into  a  table  F[K  =  1..IV]  where  each 

F[K\  =  f{D\H  \  {Hk})  -  f{D\H  \  {HK-i}) 

except  A[l]  =  f(D\H\{Hi})),  we  just  need  to  add  this  term 
to  F[I  +  1]  and  subtract  it  from  F[P].  Then  it  is  easily  seen 
that  both  the  DP  tables  Apref[/,  J]  and  Asuf[J,  J]  and  the 
table  F[I\  =  1..IV]  can  be  computed  exactly  in  0{m2N2) 
time  and  0(mN)  space,  assuming  that  fm  and  fa  are  0(1) 
time  functions  and  fui  and  fur  are  O(N)  time  functions.  □ 

A  few  simple  remarks  are  in  order:  The  probability 
f(D\H  \  {Hk})  we  wish  to  compute  can  be  extracted  from 


table  F  by  adding  up  the  region  F[1..K],  which  can  be  done 
for  all  K  in  O(N)  time. 

If  we  only  wish  to  compute  an  e-approximation  /,  for  some 
consecutive  range  of  m  different  K  values,  one  can  compute 
these  m  probabilities  f(D\H  \  {Hk})  for  each  kind  of  mod¬ 
ification  in  0(min{m,  N))  time: 


1.  To  delete  a  restriction  site  from  H:  Total  cost  = 
0(min(mA3 ,  N))  for  total  of  m  restriction  sites. 

2.  To  change  the  size  of  a  restriction  fragment  in  H  by 
6:  Total  cost  =  0(?nm(mA3,  N))  for  changing  each  of 
(m  +  1)  restriction  fragments. 

3.  To  add  T  new  restriction  sites  in  H:  Total  cost  = 
0(mm(mA3  +TA4,  N)).  Note  that  typically  T  <  m. 


4.  To  compute  the  first  two  derivatives 


ddf(D\H) 

OF* 


d  =  1,2,  and  k  =  0, . . . ,  m 


of  f(D\H)  relative  to  each  of  ?7i+l  fragment  sizes:  Total 
cost  =  0(min(mA3 ,  N)). 


3.3  Search  Algorithm  for  Haplotypes 


The  recurrence  equations  of  the  previous  subsections  and 
the  dynamic  programming  algorithms  based  on  those  allow 
us  to  efficiently  compute  the  posterior  probability  for  a  single 
possible  pair  of  maps  H\  and  H2  and  their  modifications 


o' 

CN 

.  . 

_  H^ 

_  if<2)  _ 

The  computationally  expensive  part  of  computing  the  hap- 
lotype  map  algorithm  is  the  search  over  possible  maps  H\ 
and  H2  in  order  to  find  the  one  with  the  highest  posterior 
probability. 

Initially,  we  assume  that  a  single  genotype  map  hypothesis 
H  has  been  computed  and  it  has  been  determined  that  H 
best  matches  all  data.  The  algorithms  to  compute  such  maps 
have  been  developed  [4,  3]  and  have  been  in  use  for  more  than 
five  years.  The  speed  of  the  main  algorithm,  GenTig,  has 
been  improved  through  an  important  heuristic  stage  that  re¬ 
lies  on  geometric  hashing  to  quickly  identify  the  maps  that 
overlap,  and  can  also  be  used  in  the  context  of  haplotyp- 
ing.  The  time  complexity  of  this  geometric-hashing-stage  is 
super-linear  and  is  given  as 


M 

Th  =  0{N  +  Mp3),  where  Mjj  =  E]  mj  +  1) 

i=i 
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i.e.,  Md  is  the  total  number  of  fragments  in  the  optical  map¬ 
ping  data.  We  will  see  that  the  actual  time  for  this  stage  Th 
is  dominated  by  the  remaining  computation  involving  search 
over  possible  haplotype  pairs  H i  and  H2 ,  unless  the  genome 
we  are  dealing  with  is  much  larger  than  the  human  genome; 
see  next  subsection. 

If  our  initial  hypothesis  is  H,  then  =  H ^  =  H,  and 

(2) 

at  each  stage  H{  and  H2  must  then  be  refined  by  trying  to 
add  or  delete  restriction  sites  and  by  adjusting  the  distance 
between  restriction  sites  by  doing  a  gradient  optimization  of 
the  probability  density  of  all  maps  for  each  fragment  size. 
The  result  is  h[,+1'1  and  H2+1\ 

Note  that  at  each  hypothesis-reconrputation  step,  trying 
each  new  restriction  site  polymorphism  involves  modifying 
H\  or  H2  by  adding  or  deleting  a  restriction  site  from  Hi  (or 
Hi)  only,  while  trying  an  RFLP  involves  modifying  the  same 
interval  in  both  Hi  and  Hi  by  adding  some  Sh  to  Hi  and 
subtracting  the  same  Sh  from  H2.  In  each  case  both  possible 
“phases”  of  each  polymorphism  is  to  be  accounted  for,  re¬ 
versing  the  use  of  Hi  and  H2  above.  Since  both  phases  must 
be  tested  and  the  better  scoring  one  selected,  except  when 
adding  the  first  polymorphism  to  Hi  and  H2,  the  search 
process  can  easily  turn  in  to  2°(N\ 

Note  also  that  if  the  data  cannot  allow  the  phasing  to 
be  determined  because  there  are  no  (or  insufficient)  data 
molecules  spanning  both  polymorphisms,  both  phases  (ori¬ 
entations)  will  score  almost  the  same.  This  fact  is  also 
recorded  since  it  marks  a  break  in  the  phasing  of  polymor¬ 
phisms. 

Further  note  that  RFLP  polymorphisms  are  more  expen¬ 
sive  to  score,  since  in  addition  to  the  phasing  (whether  Hi 
or  Hi  has  the  bigger  fragment)  it  is  necessary  to  determine 
the  amount  of  the  fragment  size  difference  for  Hi  and  H2 
(the  Sh  value),  which  can  be  searched  for  in  0(1)  expected 
time,  and  the  constant  is  essentially  logarithmic  in  the  ratio 
of  the  expected  fragment  length  to  the  resolution  of  optical 
mapping.  More  precisely,  this  step  involves  trying  a  num¬ 
ber  of  different  multiples  of  Sh  values  that  is  logarithmic  in 
the  number  of  total  possible  values  using  the  well  known  uni- 
modal  function  maximization  algorithm  based  on  the  golden 
mean  ratio.  As  an  example,  the  total  number  of  Sh  values  re¬ 
quired  for  any  fragment  can  be  bounded  by  about  20  if  the 
resolution  of  Sh  is  set  at  0.1Kb  and  the  largest  restriction 
fragment  length  is  50Kb;  usually,  this  number  is  extremely 
small:  just  1  or  2  small  Sh  values  are  sufficient  to  verify  that 
no  polymorphism  exists. 

A  purely  greedy  addition  of  polymorphisms  to  Hi  and  H2 
is  not  sufficient  to  get  the  phases  correct  as  the  search  can 
get  stuck  in  local  maxima  when  two  or  more  polymorphisms 
are  nearby.  We  avoid  this  problem  by  using  a  heuristic  look 


ahead  distance  of  w  restriction  sites,  and  scoring  all  combi¬ 
nations  of  polymorphisms  in  this  window,  before  committing 
the  best  scoring  set  of  polymorphisms  in  Hi  and  Hi.  With  a 
sufficiently  large  window  size  w,  the  fraction  of  the  polymor¬ 
phic  sites  the  algorithm  misses  or  phases  incorrectly  can  be 
made  negligible.  Since  this  heuristic  can  increase  the  worst 
case  complexity  of  the  algorithm  exponentially  with  the  win¬ 
dow  size  w  we  heuristically  determine  the  smallest  possible 
window  w  by  using  simulated  data  and  search  the  space  of 
possible  polymorphisms  within  a  window  by  adding/deleting 
just  one  or  two  polymorphisms  at  a  time  until  no  further  im¬ 
provement  in  the  probability  density  occurs. 

The  overall  algorithm  must  try  every  possible  restriction 
site  and  fragment  as  a  possible  polymorphic  SNP  or  RFLP 
respectively  using  a  rolling  window  of  size  w  restriction  sites. 
This  process  must  be  repeated  a  few  times  until  no  further 
polymorphisms  are  detected.  Typically  just  two  to  three 
iterations  of  scanning  all  restriction  sites  suffice. 


3.4  Complexity  and  Algorithm  improve¬ 
ments 

The  overall  complexity  of  the  basic  haplotype  search  algo¬ 
rithms  described  here,  just  using  the  basic  DP  algorithm 
from  Theorem  3.1,  is 


Time  Complexity 

M 

=  y^Time  to  compute  f(Dj\H) 

3= 1 

3N  M 

+  y^y^Time  to  compute  f(Dj\H<ll>) 

i=i  j= 1 

(*)> 


and  f(Dj\H^) 


M 


M 


o  EraJ+JV|  +3iVxO  IE  rrij  +  N 
V= 1  > 

o(M2D/c  +  N), 


u=1 


where  C  =  coverage  and  C  =  (1/./V)  Y^jLi  mj 

A  couple  of  simple  tricks  can  be  used  to  significantly  speed 
up  the  evaluation  of  conditional  probabilities.  First  Hi  and 
Hi  are  typically  only  being  modified  in  a  single  location 
at  a  time.  If  a  data  map  Dj  did  not  previously  overlap 
Hi  or  Hi  anywhere  near  the  location  we  are  modifying,  we 
can  simply  reuse  its  previous  conditional  probability  density 
values  f(Dj  \Hi)  and  f(Dj  \H2).  Since  the  average  number  of 
DNA  fragments  overlapping  any  point  of  the  genome  is  C,  a 
number  considerably  less  than  the  total  number  of  fragments 
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M,  this  makes  the  total  cost  of  the  search  for  the  best  Hi 
and  H2  asymptotically  0(Mr>m). 

The  algorithm  can  be  improved  further  if  we  use  the  dual 
DP  tables  from  Theorem  3.2,  in  which  m  consecutive  changes 
to  H\  and  H2  can  be  tested  in  just  three  times  the  time  it 
previously  took  to  test  just  one  change  to  Hi  and  H2.  In 
this  case  we  will  recompute  the  dual  DP  algorithm  for  the 
approximately  2 C  DNA  fragments  at  a  time,  where  these  are 
just  the  fragments  that  overlap  m  consecutive  sites.  Hence 
the  speedup  from  switching  to  the  dual  DP  tables  is  ap¬ 
proximately  m/6  resulting  in  an  asymptotic  complexity  of 
0(Md) 

4  Empirical  Results 

4.1  Haplotype  mapping  of  22  chromo¬ 
somes  OF  T.  Pseudoana 

Optical  Mapping  data  was  previously  collected  by  Dr.  Shiguo 
Zhou  of  University  of  Wisconsin  in  order  to  assemble  a  nor¬ 
mal  Nlrel  restriction  map  of  the  microorganism  T.  Pseu¬ 
doana  (Diatom).  To  test  the  haplotyping  algorithms  de¬ 
scribed  above,  we  selected  the  22  largest  chromosomes  (out 
of  25)  and  separated  out  the  data  for  each  chromosome  in 
order  to  be  able  to  run  the  haplotyping  algorithm  on  22  sepa¬ 
rate  machines.  For  all  except  chromosome  19,  the  algorithm 
was  able  to  successfully  phase  all  polymorphisms  and  gener¬ 
ate  two  separate  maps.  Chromosome  19  has  all  of  its  poly¬ 
morphisms  near  the  two  ends  of  the  chromosome  and  there 
were  not  enough  molecules  that  spanned  the  chromosome 
end  to  end  to  allow  their  relative  phase  to  be  determined. 

4.2  Haplotype  mapping  of  120Mbase  re¬ 
gion  OF  HUMAN  GENOME 

We  also  selected  a  subset  of  the  Human  Optical  Mapping 
data  collected  by  Mr.  Alex  Lim  of  University  of  Wisconsin 
to  assemble  the  first  genome  wide  Swal  restriction  map  of 
the  Human  genome.  The  current  data  set  provides  an  aver¬ 
age  of  12  x  data  redundancy  over  the  entire  human  genome, 
which  is  insufficient  to  reliably  assemble  a  genome  wide  map. 
Moreover  the  typical  molecule  size  of  500Kb  was  shorter 
than  assumed  by  our  simulation.  However  we  selected  data 
that  had  assembled  into  a  120Mbase  contig  and  was  iden¬ 
tified  as  part  of  chromosome  4  on  the  basis  of  alignment 
with  sequence  published  by  NIH.  Even  though  12  x  data 
redundancy  is  only  sufficient  to  assemble  a  haplotype  map 
with  low  reliably  (see  next  section)  chromosome  4  is  known 
to  have  60%  more  Swal  restriction  sites  than  the  rest  of 
the  human  genome  which,  increases  the  likelihood  that  the 


typical  500Kb  molecule  in  the  data  can  span  two  to  three 
restriction  site  polymorphisms.  Therefore  we  attempted  to 
reassemble  the  data  using  the  haplotype  assembly  algorithm. 
The  program  found  233  restriction  site  polymorphisms  and 
12  fragment  length  polymorphisms,  and  was  able  to  phase 
all  polymorphisms  into  2  contiguous  regions.  Using  simula¬ 
tion  studies  (see  next  section),  we  have  determined  that  one 
needs  to  wait  until  about  50  x  data  redundancy  is  available 
when  a  reliable  map  can  be  constructed  for  this  region  as 
well  as  the  rest  of  the  human  genome. 

4.3  Simulated  Data 

With  real  data  it  is  not  easy  to  determine  what  fraction  of 
false  positive  or  false  negative  polymorphisms  is  present  in 
the  final  map  since  the  true  haplotype  map  is  not  known 
independently  of  our  method.  To  check  if  this  is  a  prob¬ 
lem,  we  generated  simulated  data  approximating  the  first  5 
mega  bases  of  human  chromosome  21,  and  the  typical  error 
rates  for  Optical  Mapping  data  based  on  maximum  likeli¬ 
hood  estimates  from  previous  microbial  maps  [15,  18,  25] 
(The  more  recent  data  used  in  the  previous  sections  had 
maximum  likelihood  error  estimates  about  30%  better,  so 
the  estimates  used  here  are  conservative).  The  simulated 
data  was  assembled  using  different  amounts  of  simulated 
data  corresponding  to  data  redundancy  of  6x,  12x,  16x, 
24x,  50x  and  lOOx  (per  haplotype).  The  results  are  sum¬ 
marized  in  Table  1.  To  understand  these  numbers  consider 
row  4  corresponding  to  16 x  redundancy).  The  last  column 
shows  that  we  used  80  molecules  in  the  simulation.  Of  these 
80  molecules  the  software  classified  71  molecules  into  one 
of  the  two  haplotype  variants.  In  fact  the  software  made  2 
errors  and  correctly  classified  only  69  molecules.  By  com¬ 
paring  the  two  consensus  maps  generated  by  the  software 
we  created  a  list  of  restriction  sites  classified  as  polymor¬ 
phic  (i.e.  a  SNP  was  claimed  by  the  software  to  exist  at 
a  restriction  site)  and  this  list  was  then  compared  with  the 
correct  list  of  SNPs  generated  from  the  true  in-silico  maps. 
The  column  with  the  header  “fp  SNPs”  shows  the  number 
of  false-positive  SNPs  claimed  by  the  software.  The  column 
with  the  header  “fn  SNPs”  shows  the  corresponding  number 
of  false-negative  SNPs.  Similarly  for  RFLPs  (i.e.  fragment 
size  polymorphisms  due  to  the  simulated  insertions/deletions 
of  3Kb). 

5  Discussions  and  Future  Work 

Single  molecule  mapping  technologies,  such  as  Optical  Map¬ 
ping,  are  ideal  for  detecting  genetic  markers  with  phasing  in¬ 
formation  and  without  population-based  assumptions.  We 
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Figure  2:  Haplotyping  algorithm  performance  for  16  SNPs  and  24  RFLPs. 


formulated  an  abstraction  of  the  haplotype  map  assembly 
problem  for  all  such  single  molecule  mapping  technologies 
and  provide  a  probabilistic  linear  time  algorithm  to  assem¬ 
ble  haplotype  maps  by  combining  single  molecule  restriction 
maps  of  long  genomic  DNAs  of  average  length  at  least  500Kb 
containing  2  or  more  heterozygous  polymorphic  restriction 
sites  on  average. 

Single  molecule  mapping  technologies  have  many  advan¬ 
tages  over  SNP  based  approaches;  we  enumerate  four.  First, 
restriction  maps  can  reveal  not  only  SNPs  that  coincide 
with  the  restriction  sites,  but  also  micro-insertions  and  dele¬ 
tions,  global  rearrangements  or  hemizygous  deletions.  Even 
though  there  is  only  about  1  micro-insertion  or  deletion  for 
every  12  SNPs,  the  average  size  of  a  micro-insertion  or  dele¬ 
tion  is  over  36  base  pairs,  and  hence  accounts  for  over  75%  of 
all  base  pair  differences  vs.  just  25%  for  SNPs  [6].  Second, 
since  single  molecule  methods  work  directly  with  genomic 
DNA  and  do  not  require  the  use  of  PCR,  with  such  single¬ 
molecule  methods  one  can  identify  markers  in  repeat  regions, 
segmental  duplications,  SINES,  LINES  etc.-  regions  occu¬ 
pying  almost  half  of  the  human  genome.  Third,  since  sin¬ 
gle  DNA  molecule  segments  are  mapped  using  fluorescent 
microscopy,  this  approach  is  capable  of  very  high  through¬ 
put  (limited  primarily  by  the  digital  camera  throughput) 
requiring  very  little  DNA,  and  costs  a  fraction  of  the  com¬ 
parable  cost  for  the  least  expensive  SNP  based  approaches. 
Of  course  such  a  individual  haplotype  map  will  reveal  only 
those  polymorphic  markers  (including  SNPs)  that  coincide 
with  restriction  sites,  but  this  can  be  overcome  by  collect¬ 
ing  maps  for  multiple  restriction  enzymes:  Based  on  the 
known  SNPs  and  extrapolating  to  an  estimated  minimum 
10  million  SNPs,  then  using  4  restriction  enzymes  with  av¬ 
erage  fragment  sizes  of  2-4Kb  it  is  possible  to  detect  approx¬ 
imately  200,000  SNPs  plus  an  estimated  130,000  other  poly¬ 
morphisms.  This  can  be  extended  to  about  1  million  SNPs 
and  650,000  other  polymorphisms  by  using  about  50  methy- 
lation  insensitive  restriction  enzymes  in  20  groups.  Finally, 
by  suitable  choice  of  restriction  enzymes,  Optical  Mapping  is 
also  capable  of  detecting  the  epigenetic  state  in  the  form  of 


methylated  CpG  bases  which  are  resistant  to  many  restric¬ 
tion  enzymes.  Epigenetic  information  is  known  to  play  a  key 
role  in  genetic  diseases  like  cancer  and  explains  why  identical 
twins  may  display  different  genetic  traits  even  though  they 
share  the  same  genetic  code. 

We  estimate  that  our  approach  is  currently  the  only  ap¬ 
proach  that  can  produce  a  genome  wide  individual  haplotype 
map  for  under  $1000  (based  on  8  restriction  enzyme  haplo¬ 
type  maps).  The  dominant  SNP  based  approach  requires 
testing  of  about  300,000  SNPs  which  costs  at  least  ten  times 
more  per  person. 

Our  approach  can  be  applied  to  other  single  molecule  map¬ 
ping  technologies.  When  applied  to  single  molecule  tech¬ 
nologies  to  map  short  6-8bp  LNA  hybridization  probes,  it 
can  be  used  to  sequence  the  entire  human  genome:  With 
50  x  coverage  the  location  of  probes  can  be  determined  to 
within  about  200bp.  Hence  well  known  error  tolerant  SBH 
(Sequencing  by  Hybridization)  algorithms  [11]  can  be  used 
to  determine  the  sequence  within  any  200bp  window  from 
maps  of  a  universal  set  of  about  2048  probes  of  6bp,  al¬ 
lowing  a  draft  quality  individual  haplotype  sequence  to  be 
assembled  for  about  $20,000. 

Software 

HapTig  software  will  be  available  soon  at 

http : //www . bioinformatics . cims . nyu . edu/~mishra 
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APPENDIX 

The  goal  of  the  various  DP  programming  formulations  in  this 
paper  have  been  to  compute  f(D\H),  or  f(D\H')  when  H  is 
perturbed  to  yield  H’ ,  thus  yielding  the  conditional  probabil¬ 
ities  of  a  data  map  D  given  a  hypothesized  consensus  map 
H.  This  conditional  term  can  be  written  as  a  summation 
over  all  possible  (mutually  exclusive)  alignments  between  the 
particular  D  and  H ,  and  for  each  alignment  the  probabil¬ 
ity  density  is  based  on  a  straightforward  enumeration  of  the 
map  errors  implied  by  the  alignment.  The  key  to  reasonably 
fast  evaluation  of  the  probability  densities  summed  over  all 
alignments  is  the  use  of  a  dynamic  programming  recurrence 
equation,  which  is  equivalent  to  factoring  out  the  common 
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sub-expressions  of  the  probability  densities  across  the  differ¬ 
ent  alignments.  First  consider  a  single  arbitrary  alignment 
between  a  particular  D  and  H .  The  data  map  D  can  be 
described  by  a  vector  of  locations  of  restriction  sites 


D  = 


m+1 

5 

J= 0 


where  for  convenience  the  first  entry  D  [0]  is  always  0  and  the 
last  entry  D[to+1]  is  the  total  size  of  the  map.  For  notational 
convenience  we  may  also  refer  to  the  entries  of  this  array  as 
D[J ]  or  its  subarrays  as  D[J..Q ],  where  0  <  J  <  Q  <  m  +  1. 
Similarly  the  hypothesis  map  H  will  be  described  by  a  vector 
of  restriction  sites 


in  the  corresponding  region  of  the  alignment  between  D  and 
H. 


fa(I,  J,  P,  Q)  =  \Q-J-1Pd(l-Pd)P~I-1(l-Pv)Hp-Hl 

-  Dj). 

Similarly  for  an  aligned  region  that  corresponds  to  a  con¬ 
secutive  number  of  missing  fragments  the  probability  den¬ 
sity  will  be  denoted  by  a  function  fm(I,P)  (e.g.,  (I,  J)  and 
(I  +  1,  J)  will  correspond  to  /m(/,  /  +  1)). 

fm(I,P)  =  P?p~Hl. 


H  = 


with  analogous  representations  of  H[I]  and  H[I..P]  with 
0<I<P<N+1.  An  arbitrary  alignment  can  be  de¬ 
scribed  as  a  list  of  pairs  of  restriction  sites  from  H  and  D 
that  describe  which  restriction  site  from  H  is  aligned  with 
which  restriction  site  from  D.  An  example  appears  in  Fig¬ 
ure  1. 

The  expression  for  the  conditional  probability  density  of 
any  alignment  as  defined  in  Figure  1.  can  be  written  as  the 
product  of  a  term  corresponding  to  the  region  of  alignment 
between  each  pair  of  aligned  sites,  plus  one  term  for  the 
unaligned  region  at  each  end  of  the  alignment. 

Let 

Pd  =  the  digest  rate, 

A  =  the  false-positive  site  rate, 
a2h  =  the  Gaussian  sizing  error  variance 
for  a  fragment  of  size  h, 

Pu  =  the  probability  of  missing  a 
fragment  of  unit  size,  and 
Re  =  the  breakage  rate  of  DNA 

(the  inverse  of  the  expected  fragment  size). 


Finally  for  the  probability  density  of  the  unaligned  portion 
on  the  left  and  right  end  of  each  alignment,  we  shall  use 
fur(I,J)  on  the  right  end  if  (/,  J)  is  the  rightmost  aligned 
pair,  and  )  on  the  left  end  if  (/,  J)  is  the  leftmost 

aligned  pair.  These  in  turn  are  computed  in  terms  of  the 
auxiliary  functions  fr  and  /): 


fr(I,J,P,Q) 

=  Xm~J(l  —  Pd)p_I_1(l  —  PPp~Hl) 

R.e  +  (Drn+1  -  Dj ,  HP  -  H,.  HP  -  Hq ) 

+y-P=N+lG(HN+l~HI),cr2(HN+1-HI)(Dm+l  -  Dj) 


fi{i,J,P,Q) 

=  AJ_1(1  -  Pd)7-p-1(l  —  PPl~Hp) 
Re^(Dj,  Ht  —  Hp,  Hq  —  Hp) 


The  function  is  defined  as  follows2: 


For  a  random  variable  x  following  a  Gaussian  distribution 
A f(n,a2),  the  probability  density  value  at  d  is 


G^Ad) 


exp[—  (d  —  ^l)2 /2a2] 
\[2/ko 


For  an  aligned  region  that  is  not  a  missing  fragment  (e.g., 
alignments  (I,J)  and  ( P,Q ),  such  that  P  >  I  and  Q  >  J) 
this  probability  density  will  be  denoted  by  a  function  of  the 
form  /„(J,  J,  P,  Q),  which  will  depend  on  the  specific  errors 


+(d,  h,  b) 

1  ~ 

2 


erf 


-  erf 


d  —  h  +  b 


\j2a2[(h  —  b)  A  (d  V  h)]  J 

( _ h-d 

l  i/2er2[(/i  —  b)  A  {d  V  h)\  J  J 


2Notations:  a  A  b  =  max(a,  b)  and  a  V  b  =  min(a,  b). 
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Now  the  functions  fur  and  fui  are  defined  as  follows: 


{  Ep=;+i  Mi-  A  r\  P  -  1), 


if  J  <  to; 


=  < 


pHN+1-HN  +  R 


pH  N +1  H 


log  Pv 
ii  I  =  N  &  J  =  to  +  1; 

otherwise; 


and 


fm(RJ) 


{  Ep=U(7++++1), 


=  < 


P?1  +  R, 


if  J  >  1; 

p^1  _  i 

// _ _ 

log 

if  I  =  1  &  J  =  0; 
otherwise. 


Now  all  that  remains  is  to  write  down  the  probability  den¬ 
sity  of  a  particular  alignment  as  simply  the  product  of  each 
of  the  terms  fa,  fm,  fui  and  fur  while  computing  Asuf  re¬ 
cursively  as  explained  in  the  body.  Note  that  the  proba¬ 
bility  density  of  any  alignment  can  be  broken  apart  into 
the  product  of  those  terms  on  either  side  of  any  particu¬ 
lar  alignment  pair  (/,  J) .  This  observation  forms  the  basis 
of  a  two-dimensional  recurrence  using  the  array  Asuf  [/,  J] , 
where  1  <  I  <  N  and  0  <  J  <  m  +  1 . 


Asuf  [/,  J] 

=  fur(EJ) 

+1  I<N  fm(E  I  +  1) Asuf  [7  +  1,  J ] 

N  ra+1 

+  E  E  fail,  J,P,Q) Asuf [P,Q] 

P=I+ 1 Q=J+ 1 

f(D\H) 

N  ra+1 

=  EE-W7’J)  Asuf  [P  4 

1=1  J= 0 

Next  we  write  down  the  recurrence  equations  as  we  modify 
H  to  H'. 

We  start  with  an  explanation  for  how  to  efficiently  re¬ 
compute  f(D\H')  while  deleting  one  restriction  site  Hk  from 
H  at  a  time  for  all  possible  K,  1  <  K  <  N.  The  key  step 
is  to  compute  an  additional  recurrence  array  Apref  which 
represents  the  sum  of  the  probability  densities  of  all  those 
alignments  between  the  part  of  H  to  the  left  of  site  I  and  the 
part  of  D  to  the  left  of  site  J,  for  which  (/,  J )  is  the  right¬ 
most  aligned  pair.  The  corresponding  recurrence  equation 


are  shown  below: 


Apref  [I,  J] 

=  ful(I,J) 

+I/>oApref[/  —  1,  ~  1,  I) 

I-  1  J-l 

+  E  E  Apref[PQ]/a(/,J,P,Q) 

P= l Q=0 

fk(D\H) 

K  —  l  ra+1  N  ra+1 

=  EEVef  lI,J]furlI,J}+  E/“i[/’J]A-f[7’J] 

1=1  J= 0  I=K+ 1  J= 0 

ra+1 

+  E 

j= o 

IfC  <  N  Apref  \K  -  1,  J\fm(K  -  1,  K  +  l)Asuf  [K  +  1,  J } 

K  —  l  N  ra+1  1 

+  E  E  E  Apref  [I,  J]fa  (I,  J,  P,Q)  Asuf  [P,Q]  • 

1=1  P=K+1Q=J+1 


Note  that  in  equation  above,  none  of  the  terms  Apref  [/,  J] 
or  Asuf  [/,  J]  will  change  if  we  remove  the  restriction  site  Hk 
from  H.  However  the  terms  fa(I,J,P,Q)  will  change  to 
fa(I,  J i  P,  Q)/(l~Pd),  and  fur[I,  J ]  and  /„;[/,  J]  will  change 
in  a  way  we  will  describe  below,  when  Hk  is  deleted.  First 
we  rewrite  the  previous  equation  as 


fk(D\H) 

K-lm+l  N+l 

=  E  E  Ae-f I7-  A  E  J,P,P-  i) 

1=1  J= o  P=I+ 1 

N  ra+1  J-l 

+  E  E  Asuf+JiE^7’7’73’7^1) 

I=K+ 1  J= 0  P= 0 

ra+1 

+  E 

j= o 

Ix<iV  Apref  [K  -  1,  J]fm(K  -  1,  K  +  l)Asuf[A'  +  1,  J] 

K  —  l  N  ra+1 

+  E  E  E  Apref  J]  fa  (I,  J,P,Q)  Asuf  [P,Q] 

1=1  P=K+ 1  Q=J+1 


We  now  write  down  the  recurrence  equation  to  reflect  the 
deletion  of  Hk  from  H  and  corresponding  changes  in  fa ,  fi 


REFERENCES 


13 


and  fr  to  get  the  result: 


f(D\H\{HK}) 

K-lm+l  N  +  l 

=  EE  Apref  [I,  J]  fr(Hk>(^  7  P,  P  ~  1) 

7=1  J= o  P=7+ 1 

N  m+ 1  7-1 

+  E  E A-f[/’J]E/r("fc)(77P,p+ 1) 

I=K+ 1  J= 0  P= 0 

m+1 

+  E 

j= o 

[l/f<ArApref  [K  —  1,  —  1,  A'  +  l)Asuf[A'  +  1,  J] 


K- 1  JV  m+1 

+  E/  E  E  Apref  [I,  J ] 

i=l  P=if  +  1  Q  =  J+1 


fa(I,  J,  P  Q) 
l  ~Pd 


Asuf[P,  Q]  , 


where 


f~^K\K,I,J,P) 


P,P~1) 
(1  -  Pd) 

fr{P  J,  P,  P  —  2), 

0, 

fr{I,  J,P,P—  1), 


if  A  <  P  -  1; 

if  A  =  P  -  1; 
if  A  =  P; 
if  A  >  P, 


where  the  the  notations  Ap^^ffT^[A,  J]  (respectively, 

^su{K^Ht^[K  ~  7  7])  means  to  evaluate  Apref[A,  J]  (respec¬ 
tively,  Asuf[A'  —  1,  J])  using  its  defining  equation  provided 
previously  while  replacing  any  occurrence  of  Hk  with  Ht- 
Note  that  the  equation  shown  above  depends  on  the  exact 
value  of  Ht  only  in  the  last  summation  term.  Furthermore 


f+H*\K,I,J,P) 

fr(I,J,P,P-l)(l-Pd),  if  A  <  P  —  1; 

fK^T)(U  A,A-1} 

+  ^Hk-1^Ht\I,J,K,K-1),  if  K  =  P; 
fr{I,  J,P,P—  1),  if  A  >  P, 


and  analogously 
/+("K)(A,/,J,P) 

i'  fi(I,J,P,P  +  1),  if  K  <  P. 

+  fiHK~1^HT\p  J,K-  1,A),  if  A  =  P  +1; 
.  /;(/,  J,P,  P  +  1)(1  -  Pd),  if  A  <  P  —  1. 


and  analogously 


f-(HK)(K,I,J,P ) 


< 


/«(/,J,P,P  +  l), 

0, 

/;(/,  J,P,P  +  2), 
/,(J,J,P,P  +  1) 
(1  -  ft) 


if  A  <  P; 
if  A  =  P; 
if  A  =  P  +  1; 

if  A  >  P  +  1. 


Recurrence  equation  for  adding  a  restriction  site  Ht  to  H 
somewhere  between  Hk- i  and  Hk  is  shown  below. 


Recurrence  equation  for  adding  a  small  amount  5h  to  one 
restriction  fragment  [Hk,  Hk+i\-  Note  that  this  is  equiva¬ 
lent  to  changing 


(Hi, . . . ,  Hk~i,  Hk,  Hk+i,  ■  ■  ■ ,  Hk) 

-  (Hi,...,  HK-i,HK,  Hk+  i  +Sh,...,HN  +  Sh). 


f(D\H  U  {Ht},  Hk- i  <  HT  <  HK ) 

K- 1  m  N+l 

=  EEV'M  E  fr{HT\K,I,J,P) 

1=1  J= o  P=7+l 

N  m+1  7-1 

+  E  E  Asuf  +  a  E  tf(HT)  (*>  7 p) 

I=K  J=1  P= 0 

m+1  - 

Apref  [tf  -  1,  J]fm(K  -  1,  K  +  l)Asuf  [K  +  1,  J] 

J=0  - 

K-l  N  m+1 

+  E  E  E 

7=1  P=K  Q=J+1 


Apref  [J,  J]/a(/,  J,  P,  Q)(l  ~  Prf)Asuf  [P,  Q] 


+  E 


m+1 


+  E  ap lrHT\p  M^Ht)(k  - 1,  j], 


/(D|A  :  Vt>kHt  -►  At  +  (5ft) 

K  m  JV  +  1 

=  EEAp-f[77]  E  ri+Sh\K,i,j,p) 

1=1 J=0  P=I+1 

N  m+1  7-1 

+  E  EMJ.J]E/i(+,k)wj,j,p) 

7=7C+1  J=1  P=0 

m+1 

+  E 

j= o 

Ix<Af  Apref  [A,  J]P+fm(K,  A  +  l)Asuf  [A  +  1,  J] 

P-1  77  m+1 

+  EE  E  APref[/,J]/i+5h)(/,J,P,Q)Asuf[P,Q] 

7=1  P=K  Q=J+ 1 
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where 


computed  as  shown  in  the  following  formulae. 


fa+Sh\l,J,P,  Q) 

=  f(aHp^Hp+Sh\l,J,PQ ) 
ri+5h)(K,I,J,P) 

'  j-Hp-i—tHp-i+6h,Hp—*Hp+8h^j  J  p  p  _ 

if  K  <  P-  1; 

fHp^Hp+Shy,  j  p  p_  1)j 

|  if  A  =  P  -1; 

fr{I,  J,P,P~  1), 


and  analogously 


d  fm(K,K  +1)  {  {  TS  IS  I  1  \  /l  D  \d 

- —d -  =  /m(A,  A  +  l)(log  P„)  . 

fid  AN) 

-jA -  =  fm(N,  N  +  l)(Pe  +  log  Pi/)(logP^)fl 

fid  y*( 0) 

=  y"m(0,  l)(Ae  “t”  log  )  (log  )^_  1 , 

Next,  we  derive 


dfa{I,J,P,Q) 


)(!■  J,  P  Q) 


G'{Dq  —  Dj,  Hp  —  Hi)  /m(7,P)logP„ 
G'(Dq  -  Dj,  HP  -  Pj)  1  -  fm(I,  P) 


ft+dh\K,I,J,P) 


/*(7,J,P,P+1), 
if  K  <  P; 

fHp^HP-Sh^Jpp+  1)j 

if  A  =  P; 

pH  p — >Hp—5h,Hp+i — >H  p+i—Sh 
Jl 

if  K  >  P. 


(/,  J,  P,  P+  1) 


G'{Dq  —  Dj,  HP  —  Hi)  /m(7,P)logP+ 

G(Dq  —  Dj,  Hp  —  Hi)  1  P)  J 

G"(Dq  -  Dj,  Pp  -  Pj)  /  G'(Dq  -  Dj,  Pp  -  Pj) 


Finally  the  first  two  ( d  =  1,  2)  partial  derivatives  of 
f(D\H)  relative  to  all  fragment  sizes  Fk  =  Hk+ i  —  Hk, 
0  <  K  <  N,  can  be  computed  by  using  the  recurrence  equa¬ 
tion  shown  below. 

8df(D\H) 


EEmm  £ 

j=i  j=o  p=j+i  °  K 


G(Dq-Dj,Hp-Hi)  VG(Dq-Dj,Pp-Pj) 
/m(/,P)  logP2' 
l-/m(7,P)2  _  • 


where 


G(d,h)  =  eM-(d-h)2/2a2h} 


27 rcr2/l 


n  (d2-h2-a2h\ 

G{d’k)  =  f  2a2/P  J 


G(d,A) 


G"(d,  h)  = 


Furthermore, 


d2  —  h2  —  a2h]  \  2 


d2  1 
cr2ft3  +  2  A2 


XTo  G(d,A) 


/  =  K+1  J=1 


Apref  [iV ,  772+1] 


+Ijc=oAsuf  [1,  0]- 


ddf{A 


dfr{I,  J,P,P~  1) 


/r,: 0  a,  2  P)  -  fr  ’  (R  J,P ),  if  A  <  P  —  1; 
+’1)(7,J,P),  if  A  =  P  —1; 

0,  if  A  >  P, 


Ii?<jvApref  [A,  j]  ddfm  +  11  Asuf  [A  +  1,  J] 

L  Of  k 


K  N  m+1 

+ +  +  +  Apref  [7,  j] 

J=1  P=K+1 Q=J+1 


ddfa{I,  J,  P,  Q) 


Asuf  [P,  Q] 


and  analogously 
dfi(I,  J,  P,  P  —  1) 


0, 

fP\l,J,P), 


if  A  <  P, 
if  A'  =  P; 


/,(o,1)  (/,  2  P )  -  /,(M)  (^,  2  P),  if  K  >  P; 


The  differential  expressions  in  the  recurrence  equations  are 
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Next,  we  consider  the  second  derivatives: 


Analogously, 


d2/r(7,  J,P,P-  1) 

dFl 

f  fra’2\l,J,P)  —  frb’2\l,J,P), 

l  o, 


if  K  <  P  -  1; 
if  K  =  P  —  1; 
if  K  >  P, 


and  analogously 


a2/,(/,j,p,p-  i) 

dFK 

(  0,  if  K  <  P, 

=  ]  /,(o,2)(/,  J,  P),  if  K  =  P- 

{  //°'2)(7,J,P)-jf'2)(7,J,P),  if  K  >  P; 


fla^{I,J,P) 

=  A  ^(f-P,)^-1! 

(1  -fm(PI)) 

Re<fa(Dj,  H,  -  HP,  H r  -  HP+ 1) 
+Ip=oG,{Dj,HI-HP) 

- fm(P,I )  logP„ 

R^iDj,  Hj  -  HP,  H:  -  HP+ 1) 
Ip=oG{Dj,Hi  -  HP)]\ 


//“’2)(/,J,P) 


=  \J-1(l-Pd)I~p-1 

The  terms  /r°’2\  /j(a,1\  and  //“’2)  are  defined  as  Re^'a(Dj,  IE  -  HP,  IE  -  HP+1) 

shown:  -i 

+lp=0G"{Dj,HI-Hp)\ 


f<?'l\l,J,P) 

=  Am-7(l-Pd)p-/-1| 

(1  —  fm(I,  P)) 

Pe^a(P™+i  -  7Aj,Pp  -  HpHp- 1  -  Pi) 
+Ip>jvG"(Pq  -  Dj,Pp  -  Pi) 

-fm(PP)  log  Pv 

Re<f(Dm+1  -  Dj,HP  -  HpHp.!  -  IP) 

I  p>nG(Dq  —  Dj,HP  —  PE)  1 1 


/^’2)(/,J,P) 

=  Am-7(1-Pd)p-/-1 

Pe^(7)m+1  -  Pj,Pp  -  Pi,Pp- 1  -  Pi) 
+Ip>jvG"(Pq  -  Dj,HP  -  Pi)  1 


Next,  the  terms  frb,1\  frb’2\  //b,1\  and  //b’2^  are  defined 
as  shown: 

/^m)(7,J,P) 

=  Am-J(l-Pd)p-/-1(l-/m(/,P)) 

Pe\P6(Pm+i  -  Dj,Pp  -  Pi,Pp_ i  -  Pi) 

and 

/^’2)(/,J,P) 

=  Am-J(l-Pd)p-/“1 

fl«^(Om+i  -  Pj,Pp  -  Pi,Pp- 1  -  Pi) 

Analogously, 

//M)(7,J,P) 

=  AJ-1(l-Pd)/-p-1(l-/m(P,7)) 

Pe^b(Pj,  Pi  -  Pp,  Pi  -  Pp+i) 

and 

jf’2)(7,J,P) 

=  AJ-1(1  —  Pd)I~F~1 

Re<K'b(Dj,  PE  -  HP,  PE  -  Pp+i) 
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Here  the  functions 
definitions. 

Va(d,hi,h2)  = 
'Sf'a(d,hi,h2)  = 
tyb(d,hi,h2)  = 
^'b(d,hi,h2)  = 


a,  ^a,  'h b ■  and  ^’b  have  the  following 

exp[—  (d  —  hi)2 /2a2 (d  V  hi)  A  h2] 

1)  A  h2 

'I ’a{d,  hi,h2) 

(d  V  hi)  A  h2] 

1)  A  h2 

'P0(d,  hi,h2) 


y/2ira2(d  V  h 


d  —  hi 


2a2(d  V  hi)  A  h2 
exp[— (d  —  h2)2 /2a2 


\j2'Ko2(d  V  h 
d  —  h2 

2 a2(d  V  hi)  A  h2 


